Introduction to Pandas

As a personal preference, I believe it is better not to import all the functions into the current namespace.


In [5]:
import numpy as np
import pandas as pd

There are 3 types of data structures.

Data Structure Dimensions
Series 1-Dim
DataFrame 2-Dim
Panel 3-Dim

We will be dealing with Series and DataFrames. We will not be handing Panel here.

All datastructures have both List-like and Dict-like properties.

Series

A Series at it's simplest form can be created from a dict.


In [4]:
data = {'Mon':'Monday',
        'Tues':'Tuesday',
        'Wed':'Wednesday',
        'Thurs':'Thursday',
        }
s = pd.Series(data)
s


Out[4]:
Mon         Monday
Thurs     Thursday
Tues       Tuesday
Wed      Wednesday
dtype: object

In [5]:
s.index


Out[5]:
Index([u'Mon', u'Thurs', u'Tues', u'Wed'], dtype='object')

A Series can also be created from a sequence of values and a sequence of index.


In [19]:
s = pd.Series(np.random.randint(5, 15, 7), ('Mon', 'Tues', 'Wed', 'Thur', 'Fri', 'Sat',
            'Sun'), name='Temperature')

In [20]:
s.index.name = "Day of the Week"

In [21]:
s


Out[21]:
Day of the Week
Mon                12
Tues               12
Wed                14
Thur                5
Fri                 5
Sat                14
Sun                12
Name: Temperature, dtype: int64

Series as a Dict


In [22]:
s['Tues']


Out[22]:
12

In [23]:
'Mon' in s


Out[23]:
True

In [24]:
'Son' in s


Out[24]:
False

The Series can also be sliced using index.


In [25]:
s['Thur':'Sun']


Out[25]:
Day of the Week
Thur                5
Fri                 5
Sat                14
Sun                12
Name: Temperature, dtype: int64

Series : as an ndarray


In [26]:
s.max()


Out[26]:
14

In [27]:
s + 2*s #Vectorized operation


Out[27]:
Day of the Week
Mon                36
Tues               36
Wed                42
Thur               15
Fri                15
Sat                42
Sun                36
Name: Temperature, dtype: int64

In [28]:
s[1] #Accessing a value by position


Out[28]:
12

In [29]:
s[2:5] #Slicing the Series by position


Out[29]:
Day of the Week
Wed                14
Thur                5
Fri                 5
Name: Temperature, dtype: int64

In [33]:
s[:1]


Out[33]:
Day of the Week
Mon                12
Name: Temperature, dtype: int64

In [39]:
s - np.random.randint(5, 15, 7)


Out[39]:
Day of the Week
Mon                7
Tues               0
Wed                9
Thur              -1
Fri                0
Sat                0
Sun                5
Name: Temperature, dtype: int64

In [42]:
for x in s: print x #iterating over values


12
12
14
5
5
14
12

In [43]:
for pos, value in enumerate(s): print pos, ':', value


0 : 12
1 : 12
2 : 14
3 : 5
4 : 5
5 : 14
6 : 12

In [44]:
for key, value in s.iteritems(): print key, ':', value


Mon : 12
Tues : 12
Wed : 14
Thur : 5
Fri : 5
Sat : 14
Sun : 12

DataFrame

Dataframe is a two dimensional array, and probably the most used data structure in Pandas. The columns themselves can have different data types but all the values within each column should be of the same datatype.

A dataframe can be created from

  • python dict
  • csv
  • xls

-Now let us look at the obligatory Day-Temperature example.


In [1]:
import datetime

In [13]:
base = datetime.datetime.today()
days = 20
date_list = [base - datetime.timedelta(days=x) for x in range(0, days)]
date_list = [datetime.date(x.year, x.month, x.day) for x in date_list]
date_list.reverse()
data = {'date':date_list, 
        'Chennai':np.random.randint(25,35,days), 
        'Mumbai':np.random.randint(15,25,days), 
        'Delhi':np.random.randint(5,15,days)}
df = pd.DataFrame(data)

In [14]:
type(df)


Out[14]:
pandas.core.frame.DataFrame

In [15]:
df.head()


Out[15]:
Chennai Delhi Mumbai date
0 29 5 19 2014-11-02
1 33 11 24 2014-11-03
2 27 14 19 2014-11-04
3 30 9 20 2014-11-05
4 27 5 15 2014-11-06

In [18]:
df = df.set_index('date')

In [19]:
df.head()


Out[19]:
Chennai Delhi Mumbai
date
2014-11-02 29 5 19
2014-11-03 33 11 24
2014-11-04 27 14 19
2014-11-05 30 9 20
2014-11-06 27 5 15

In [20]:
df.median()


Out[20]:
Chennai    30.0
Delhi       9.5
Mumbai     19.0
dtype: float64

In [21]:
df.mean()


Out[21]:
Chennai    29.60
Delhi       9.15
Mumbai     19.20
dtype: float64

In [24]:
df.diff().head()


Out[24]:
Chennai Delhi Mumbai
date
2014-11-02 NaN NaN NaN
2014-11-03 4 6 5
2014-11-04 -6 3 -5
2014-11-05 3 -5 1
2014-11-06 -3 -4 -5

Obligatory CSV Example ;)


In [25]:
titanic = pd.read_csv('data/titanic.csv')

In [33]:
titanic = titanic.set_index('PassengerId')

In [34]:
titanic.head()


Out[34]:
Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
PassengerId
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 NaN S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 NaN S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 NaN S

In [29]:
len(titanic)


Out[29]:
891

In [30]:
titanic.Fare.sum()


Out[30]:
28693.9493

In [31]:
titanic.Survived.value_counts()


Out[31]:
0    549
1    342
dtype: int64

In [35]:
titanic.Pclass.value_counts()


Out[35]:
3    491
1    216
2    184
dtype: int64
Lets dive in!

In [ ]: